This document details visualization in anvio

Anvio is run in a dedicated environment.

conda activate anvio-7.1

Get bin info

Get bin info into a format that anvio can use. This means concatenating the bin files for each method, so there’s a list of which contig/read goes in which bin

# get all bin directories
path <- list.dirs("../data/Bins")

# for loop for each binning method
for (i in 2:7){
  
  DF <- NULL
  
  pathname <- path[i]
  filelist <- list.files(paste0(pathname, "/"))
  
  # get list of all contigs and reads for all bins into 1 tsv file
  for (filename in filelist){
    df <- read.csv(paste0(pathname, "/", filename), header = F)
    df <- as.data.frame(df)
    colnames(df) <- "read"
    df$bin <- str_replace(filename, "[.]", "_")
    
    # change names if a number is at the beginning of the bin name
    if (basename(pathname) == "24_sample_bam_bins"){
      df$bin <- str_replace(df$bin, "24", "twentyfour")
    }
    if (basename(pathname) == "47_sample_bam_bins"){
      df$bin <- str_replace(df$bin, "47", "fortyseven")
    }
    
    DF <- rbind(DF, df) 
  }
  
  write.table(DF, paste0("../output/all_bins/", basename(pathname), ".tsv"), row.names = F, col.names = F, quote = F, sep = "\t")
  
}

Examine tsv files.

tsv_output <- read.csv("../output/all_bins/assembly_bins.tsv", sep = "\t")
kable(head(tsv_output, 5))
MG1058_s821.ctg000852l assembly_bin_1
MG1058_s1105.ctg001148l assembly_bin_1
MG1058_s1585.ctg001645l assembly_bin_1
MG1058_s1820.ctg001893l assembly_bin_1
MG1058_s645.ctg000674l assembly_bin_10
MG1058_s914.ctg000951l assembly_bin_10

Import bins into anvio

Get the bins into the anvio database already created.

# Example for one bin import, change import and -C for each
anvi-import-collection "./github/jordan-marinimicrobia/output/all_bins/short_reads_bam_bins.tsv" -p "./Downloads/plus_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_plus/1058_P1_2018_585_0.2um_assembly_plus.db" --contigs-mode -C shortreads

Run interactive browser

anvi-interactive -p "./Downloads/plus_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_plus/1058_P1_2018_585_0.2um_assembly_plus.db"

Example of what the interactive browser looks like with bins. Anvio calculates statistics like completion and redundancy for each bin.

Anvio interactive browser

Refine bins

Dig into “contaminated” bins to see how/why they are contaminated. Reminder that “.” is changed to “_” and 24 and 47 are written out in the anvi bin database.

anvi-refine -p "./Downloads/assembly_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_only/1058_P1_2018_585_0.2um_assembly.db" -C shortreads -b short_reads_bam_bin_163

Example of a contaminated bin. The coverage is not consistent, there are many branches within the clustering algorithm anvio uses to group sequences, and there are many duplicated single copy core genes.

Anvio interactive display of a contaminated bin

Get summary statistics

I used anvi-interactive to get a summary of all bins in each bin collection. This is an example output file that contains information on size of bins, contamination, etc.

summary <- read.table("../output/anvio_outputs/assembly_plus_summary.txt", sep = "\t", header = T)
summary(summary)
##      bins            total_length       num_contigs           N50         
##  Length:274         Min.   :  202581   Min.   :   1.00   Min.   :  10203  
##  Class :character   1st Qu.:  310692   1st Qu.:   9.00   1st Qu.:  12386  
##  Mode  :character   Median :  488296   Median :  23.00   Median :  14620  
##                     Mean   :  879731   Mean   :  52.00   Mean   :  74294  
##                     3rd Qu.:  896170   3rd Qu.:  50.75   3rd Qu.:  50200  
##                     Max.   :20079188   Max.   :1582.00   Max.   :3034959  
##    GC_content    percent_completion percent_redundancy   t_domain        
##  Min.   :26.47   Min.   :  0.00     Min.   :   0.00    Length:274        
##  1st Qu.:38.99   1st Qu.:  0.00     1st Qu.:   0.00    Class :character  
##  Median :45.84   Median :  0.00     Median :   0.00    Mode  :character  
##  Mean   :47.44   Mean   : 13.16     Mean   :  16.19                      
##  3rd Qu.:57.02   3rd Qu.: 23.59     3rd Qu.:   0.00                      
##  Max.   :69.50   Max.   :100.00     Max.   :2053.52                      
##    t_phylum           t_class            t_order            t_family        
##  Length:274         Length:274         Length:274         Length:274        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    t_genus           t_species        
##  Length:274         Length:274        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Creating pangenome

Create anvio dbs for my bins and annotate them with COG, Kegg, HMMs, and tRNAs. Use the interactive database to visualize the pangenome, and find variable regions of the Sulfitobacter genome. These I will dive into further in the next section.

anvi-gen-contigs-database -f sulf_genomes/assembly_plus_bin_4.fa -o sulfbin4.db

anvi-run-hmms -c sulf_genomes/dbs/sulfbin4.db
anvi-run-scg-taxonomy -c sulf_genomes/dbs/sulfbin4.db
anvi-scan-trnas -c sulf_genomes/dbs/sulfbin4.db
anvi-run-ncbi-cogs -c sulf_genomes/dbs/sulfbin4.db
anvi-run-kegg-kofams -c sulf_genomes/dbs/sulfbin4.db

anvi-gen-genomes-storage -e sulf-external-genomes.txt \
                         -o sulf-GENOMES.db

anvi-pan-genome -g sulf-GENOMES.db -n sulfitobacter

anvi-display-pan -g sulf-GENOMES.db -p sulfitobacter/sulfitobacter-PAN.db

Pangenome visualization.

Sulfitobacter pangenome